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In data mining, discrimination is the detrimental behavior of the people which 
is extensively studied in human society and economical science. However, 
there are negative perceptions about the data mining. Discrimination has two 
categories; one is direct, and another is indirect. The decisions depend on 
sensitive information attributes are named as direct discrimination, and the 
decisions which depend on non-sensitive information attributes are called as 
indirect discrimination which is strongly related with biased sensitive ones. 
Privacy protection has become another one of the most important problems in 
data mining investigation. To overcome the above issues, an Efficient 
Association Representative Rule Concealing (EARRC) algorithm is proposed 
to protect sensitive information or knowledge and offer privacy protection with 
the classification of the sensitive data. Representative rule concealing is one 
kind of the privacy-preserving mechanisms to hide sensitive association rules. 
The objective of this paper is to reduce the alternation of the original database 
and perceive that there is no sensitive association rule is obtained. The 
proposed method hides the sensitive information by altering the database 
without modifying the support of the sensitive item. The EARRC is a type of 
association classification approach which integrates the benefits of both 
associative classification and rule-based PART (Projective Adaptive 
Resonance Theory) classification. Based on Experimental computations, 
proposed EARRC+PART classifier improves 1.06 NMI and 5.66 Accuracy 
compared than existing methodologies. 
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1. INTRODUCTION 

The word discrimination invents from the Latin discriminate, which means to differentiate among 
discrimination functionalities. The social and financial discrimination is the unfair treatment of people on the 
basis of their type. At investigation part, the discrimination has become an issue in credit, finance, insurance, 
labor marketplace, education and other human being actions which has attracted much investigator preference 
in financials and social science. There are numerous decision-making processes available and it offers 
themselves to discrimination, e.g., education, loan granting, health insurances and employee’s selection. An 
automated framework decides whether the customer is to be suggested for credit or some kinds of life insurance 
in a specific set of data items for the available customer. 

Problem: Privacy protection has become one of the most important problems in data mining. Several 
privacy-preserving data mining mechanisms have been proposed in which the existing literature is based on 
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either a cryptographic or a statistical method. Privacy-preserving association rule protects sensitive data item 
from unnecessary or illegal discovery. The secure multi-party method utilized in the cryptographic mechanism 
which ensures strong confidentiality and accurateness. However, the technique usually suffers from privacy 
and time complexity. Most existing methods are utilized for resolving discrimination issues such as 
preprocessing, in-processing and post-processing approach. Generally, Rule-based frameworks are deployed 
IDPSes (Internally Displaced Persons) and achieve better results; when the signature data is precise. The rules 
derived from them which are accurately built by a rule generator. However, a physical attack is determined in 
derived rules and it alters using few previous rules. 

Background: In [1] explained direct and indirect system-level discrimination in the training 
information. The method proposed in this work for expands the non-discrimination outcome from the training 
information for data prediction. The group-level direct discrimination and individual-level direct 
discrimination were studied. In [2] addressed the two-phase co-occurrence association rule mining method to 
recognize implicit aspects. It contained two stage rule generation. The first stage of rule generation was 
happened in an explicit ruling in the corpus for every opinion words. The second stage of rule application was 
clustered the rule consequent (explicit attributes) to create more robust rules for each opinion word. 
In [3] discussed a discrimination discovery approach that depends on modeling of possibility sharing of a 
context utilizing Bayesian networks. It computed the consequence of a protected feature in a subset of the 
dataset. A classification technique corrected the determined discrimination without utilizing protected features 
in the decision process. In [4] explained a Data Envelopment Analysis (DEA) that evaluates the rank of 
association rules with various kinds of criteria’s for example as support and confidence. In [5] discussed data 
transformation approaches such as rule protection and rule generalization which depends on direct and indirect 
discrimination with numerous discriminatory products. 

In [6] described sensitive attributes like gender, religion, race, etc. that influence the discriminatory 
decisions. The decisions were made on the basis of biased sensitive attributes and non-sensitive attributes. 
In [7] addressed a causal Bayesian networks technique; where, the method captured discrimination based on 
a legally grounded situation testing methodology. The method utilized the causal Bayesian Networks and 
associated with causal inference guidelines. In [8] focused on the cleaning and outsourcing of training datasets 
using legitimated classification rules to extract the discriminating rules. Legitimated classification rules utilized 
to predict intrusion, fraud or crimes; where to be highly focused on sensitive attributes. In [9] reviewed 
discrimination and estimated the performance of discrimination aware predictive models. 
It reviewed and discussed for measuring the procedures and expressed the recommendations for practitioners 
in the domain of data mining, machine learning, pattern recognition, statistical modeling; that are developing 
non-discriminatory predicative models. In [10] developed a discrimination-aware data mining (DADM) 
method for deriving the patterns. The technique does not discriminate “unjust grounds” like gender, 
ethnicity or nationality. 

In [11] illustrated a concurrent chronic disease in the course of treatment, it took two types of 
comorbid datasets as resultant input. Several popular machine learning techniques such as Logistic Regression 
(LR), Random Forest (RF), etc. applied to build predictive models. In [12] explained discrimination aware 
association rule classifier (DAAR) is used to filter out the discrimination issues. Discrimination aware 
measurements are incorporated and associated with rule mining algorithm. In [13] surveyed various 
discrimination discovery and discrimination prevention methods to identify the feature and limitation of 
technique. The paper has explained the antidiscrimination technique for compromising the discrimination 
discovery and prevention. In [14] discussed the evaluation results on over four types of discrimination, i.e., 
direct discrimination, indirect discrimination, individual-level discrimination, and group-level discrimination. 
The technique preferred casual networks to capture the existence of discrimination patterns that provided 
quantitative evidence of discrimination in decision making. In [15] elaborated the WEKA workbench and 
organized data preprocessing tools for state-of-art machine learning algorithms. The system offers a convenient 
graphical user interface for data exploration, larger implementations setup on distributed computing 
environments with configured streaming for data processing. In [17] illustrated integration of Adaptive Weight 
Ranking Policy (AWRP) with intelligent classifiers (NB-AWRP-DA and J48-AWRP-DA) through dynamic 
aging feature to enhance classifiers power of prediction. The schemes are utilized to select the best subset of 
aspects. In [18] studied to detect the best classifiers for class imbalanced health datasets through a price 
depended comparison of classifier performance. The uneven misclassification prices were characterized in a 
cost matrix, and cost-benefit. In [19] discussed the WEKA tool for higher education institutes utilize a data 
mining tools and techniques for academic development of the student performance and to prevent drop out. 

Proposed Solution: The research aims to design an Efficient Association Representative Rule 
Concealing (EARRC) algorithm is proposed for protecting sensitive information or knowledge and offers 
privacy protection with the classification of the sensitive data. The method Representative rule concealing is 
one kind of the privacy-preserving mechanisms to hide sensitive association rules. The objective of this method 


Indonesian J Elec Eng & Comp Sci, Vol. 15, No. 1, July 2019 : 527 - 534 


Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 Oo 529 





is to reduce the alternation of the original database and perceive that there no sensitive association rule is 
obtained. The proposed method hides the sensitive information by altering the database without modifying the 
support of the sensitive item. The technique is used to enhance the domain of the lost rule and ghost rule side 
effects. The lost rule is hiding sensitive rules completely. It is not affected the non-sensitive rules. In hiding 
process, no extra fake rules are incorrectly extracted; it is called Ghost rule. It is an evolutionary mechanism 
to resolve the compound issues and require optimal sanitization. Degradation of information is computed in 
two dimension aspects. The first dimension computes the confidential information protection and second 
calculates the loss of functionality. The proposed work discusses effective mechanism for privacy preservation 
and discrimination prevention to be deployed. The EARRC is a type of association classification approach 
which integrates the benefits of both associative classification and rule-based PART classification. The PART 
is a rule-based classifier to predict the performance. The method prevents discrimination prevention and 
improves the accuracy: 

a. To develop Efficient Association Representative Rule Concealing (EARRC) algorithm that is utilized for 
protecting sensitive information or knowledge and to hide sensitive association rules. 
To offer privacy preservation with the prediction of the sensitive data 
To alter the original database and perceive that there is no sensitive association rule obtained. 
To compute the confidential information protection and the missing functionality. 
To improve the Normalized Mutual Information (NMI) and Accuracy compared than their 
existing methods 

The rest of paper is organized as: Section 2 describes the literature study with the closest conventional 

method. Section 3 describes the proposed methodology with implementation details. Section 4 discusses 
implemented result and comparative study with the conventional technique. 


ene s 


2. RESEARCH METHOD 

This research work proposes an Efficient Association Representative Rule Concealing (EARRC) 
algorithm to protect sensitive information or knowledge for hiding sensitive association rules and offering 
privacy protection with sensitive data predictions. EARRC is divided into following modules like loading data, 
preprocessing of data, Frequent Itemset Generation, rule generation, Classification, EARRC Algorithm. The 
workflow diagram of the proposed system is illustrated in Figure 1 stepwise. 


2.1. Implementation Pre-processing Steps 
2.1.1 Loading Data 

Loading data is a process to browse the biased data set in the proposed framework. The data contains 
the file name, file size, time, the total number of attributes, and the total number of records. The method predicts 
the attributes of sensitive information that contains the column; attribute name, description. 


2.1.2 Preprocessing and Data Cleaning 

The method processes the data with discriminatory biases that is comprised of the original sensitive 
information. It eliminates zero unfair decision rules which can be extracted from the transformed sensitive 
information. The method acquires discrimination free information and applies some standard data mining 
algorithm. The sensitive information transformation and frequent item set generalization can be adapted from 
the privacy preservation utilizing EARRC methodology. 


2.1.3 Frequent Item set Generation 

The EARRC algorithm extracts the recurrently occurring item sets in a specific biased data set. The 
input is a set of transactions with sensitive items, and the output is the sensitive items with a constraint confident 
of item sets. It generates a set of candidate item sets and counts. 


2.1.4 Representative Rule generation 

Representative Rule generation is generating the improved privacy of association rules for each 
frequent item set; where, each rule is a binary partition of a frequent item set. The method considers reliable, 
sensitive information and creating a universal statement of each item. The EARRC technique evaluates 
common ideas by abstracting the general properties (name, country, profession, DOB, income, addresses, etc.) 
form of the training dataset. The method applies nominal attribute of the biased data set and transforms numeric 
feature into a range of information. 
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Figure |. Workflow diagram of the proposed system 


2.1.5 Data Prediction 

Data prediction is processed to compute group assignments or membership for sensitive information 
occurrences of the training dataset. The prediction is evaluated with the reference of the original data set. The 
aim of the classification mechanism to analyze the input data set and develop a correct model for every 
grouping attributes which are available in the present in the sensitive information. 


2.2. Efficient Association Representative Rule Concealing (EARRC) Algorithm 

The Efficient Association Representative Rule Concealing (EARRC) algorithm is implemented to 
protect sensitive information or knowledge for hiding sensitive association rules and offering privacy 
protection with sensitive data predictions. The rules are described in representative rules (RR) sensitive data 
on the left or right-hand side of the rules. The technique selects a rule from the set of RR’s which comprises 
sensitive data. The method selects database operations which includes all the sensitive data in the RR. The 
proposed EARRC method hides the sensitive data by altering the database without modifying the support of 
the sensitive data. 

The association rules are determined in a given dataset. RR is a set of rules which allows for assuming 
all association rules without accessing a data set. The cover operator C initiated for a dynamic set of association 
rules from a provided association rule. Representative rules creating process is decomposed into two sub- 
procedures such as frequent item-sets generation and RR prediction from frequent item-sets. The frequent item 
set is p # ACB. The association rule A>Z/B is the representative rule; if there is no association rule (A>Z '/ 
A)‘. Where Zc Z’, and there is no association rule (A '=>Z / A’) such that ADA '. A set of representative rules 
(RR) for a provided association rules (AR) can be described as (1). 


RR = {r € AR|FASr’ € AR,r' #randre€C(r’) (1) 


The C is the Candidate item set. Every rule in RR is called representative association rule. There is no 
representative rule may suitable in the coverage of another association rule. An imbalanced biased dataset, 
minimum support, and confidence are provided as an input of the algorithm. 

The pseudo code of proposed algorithm is given below in details: 

The Input: S is an imbalanced biased data set, mi_support, mi_confidence, and F is a set of sensitive 
data items. 

Output: A transformed database S’ where representative rules (RR) including F and visualize Normalized 
Mutual Information (NMI) and Accuracy 

Procedure: 
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Start; 
Compute item sets from Dataset S; 
Every sensitive data item f€ F; 


If f is a small item set then 
F=F-{f}; 
If F is null then EXIT; 
Select a representative rule RR from the dataset; 
Arrange RR in descending order by supported items; 
Choose r (association rule) from RR 
Estimate confidence of rule r; 
If conf>mi_conf then 
{ //modify the place of sensitive information item f. 
Find Ti = {t (subset) in S It completely supports RR; 
If t comprises attribute and f then 
Eliminate f from t; 


Else 
Find Ti = {t in Slt does not support and partially supports attributes; 
Add f to t 
Select the first rule from RR; 
Compute confidence of r; 
Until (RR is empty); 
End If 


If conf>mi_conf 
Update S with new item transaction t; 
Calculate and visualize Normalized Mutual Information (NMI) 
and Accuracy 


Else 
It failed to compute and visualize Normalized Mutual Information (NMI) 
and Accuracy 

End If 

End 


3. RESULTS AND ANALYSIS 
3.1. Programming Environment 

The implementation work is deployed on Intel 16th processor, 8 GB RAM and 500 GB memory with 
the windows7 ultimate operating system. The proposed framework is developed in JAVA programming 
language, JDK 1.8, NETBEANS 8.0.2, with MYSQL database. The proposed technique is used WEKA library 
with Dataset. 


3.2. Data Set 

In The paper utilizes two real datasets, Adult and Dutch Census, from the UCI Repository of Machine 
Learning Databases. These two datasets are usually utilized in a discrimination investigation. The Adult dataset 
comprises 48861 tuples (after eliminating those tuples with missing qualities) with 14 attributes. The analytical 
task is to classify people into high and low salary classes. It is outstanding that various attributes in the Adult 
dataset are weakly relevant to gender, for example, work class, education, job, race, capital loss, native. 

Dutch Dataset: For the Dutch dataset, our fixed hierarchical log-linear model is class variable 
described whether a people’s occupation was high income or low income, and its sensitive attribute illustrated 
the people’s gender. The size of the dataset was 32,584, and the number of non-sensitive attributes was 10. 
Note that all attributes are categorical and were transformed into multiple binary attributes by a log-linear 
model (1-of-K) method. 


3.3. Normalized Mutual Information (NMI) 

The NMI is to measure the results among 0 (no mutual information) and | (perfect correlation). NMI 
is described by normalizing the mutual information into a range [0, 1]. The proposed approach is defined as a 
mathematical model for Normalized Mutual Information in (2). The Normalized Mutual Information is 
calculated as: 
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I(¥;S) 


NMI, S) = eS 


(2) 


Where I (*; *) and H (*) represent mutual information and entropy, respectively. Where Y is class 
labels and S is Cluster labels. 


3.4. Accuracy 

Support Accuracy is defined as computes the ratio of correct or true predictions over the total number 
of instances estimated. The proposed approach is defined as a mathematical model for accuracy in (3). The 
accuracy is calculated as: 


Tpt+Tn 
Tp+Fpt+Tnt+Fn 


(3) 


Accuracy = 


TP is true positive values or correctly classified values, and TN is true negative values. FN is false 
negative values, and FP is false positive values.The proposed EARRC system is computed with following 
existing methods such as Naive Bayes (NB) [16], Logistic Regression (LR) [16] and Support Vector Machine 
(SVM) [16] methods. The proposed EARRC is to protect sensitive information or knowledge. The proposed 
method also hides sensitive association rules and provides privacy protection with the classification of the 
sensitive data. Proposed EARRC algorithm is integrated with a rule-based PART classifier to improve the 
Normalized Mutual Information (NMI) and Accuracy. 

According to Figures 2 and 3 observations, the proposed EARRC+PART technique is computed with 
conventional technique on behalf of Normalized Mutual Information (NMI) and Accuracy. Proposed 
EARRC+PART algorithm is estimated with Naive Bayes (NB), Logistic Regression (LR) and Support Vector 
Machine (SVM) [16] methodologies behalf of on Normalized Mutual Information (NMI) and Accuracy to 
estimate the efficiency of the proposed technique. The naive Bayes is a supervised learning classifier, utilizing 
Bayesian inference and the (often incorrect) assumption that parameters are independent. But, it provides the 
low Accuracy and NMI for compare than proposed EARRC+PART classifier. Logistic Regression is utilized 
to explain data and for describing the relationship between one dependent binary variable and one or more 
nominal, ordinal, interval or ratio-level independent variables. It is the nearest competitor on behalf of 
accuracy. However, it fails to maintain NMI. The SVM is the nearest competitor to a proposed EARRC+PART 
method for NMI and Accuracy. SVM is supervised learning models with associated learning algorithms that 
investigate utilized data for classification and regression evaluations. It consumes more time for data processing 
and does not assure for data accuracy. EARRC+PART algorithm offers the high NMI and Accuracy. Proposed 
EARRC+ PART improves 1.06 NMI and 5.66 Accuracy. Finally, the paper claims that the proposed 
EARRC+PART methodology performs best on every evaluation matrix and respective input parameters. 
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Figure 2. Normalized Mutual Information (NMI) for Adult and Dutch Data set 
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Figure3. Accuracy for Adult and Dutch Dataset 


4. CONCLUSION 

The paper presents An Efficient Association Representative Rule Concealing (EARRC) algorithm to 
protect sensitive information or knowledge and provide privacy protection with the classification of the 
sensitive data. The objective of this paper is to minimize the alteration of the original database and perceive 
that there is no sensitive association rule is obtained. The proposed method hides the sensitive information by 
altering the database without modifying the support of the sensitive item. The rules are described in 
representative rules (RR) sensitive data on the left or right-hand side of the rules. The technique selects a rule 
from the set of RR's which comprises sensitive data. Representative rules designed two sub-procedures such 
as frequent item-sets generation and RR prediction from frequent item-sets. Proposed EARRC+PART improve 
1.06 NMI and 5.66 Accuracy. Finally, the paper claims that the proposed EARRC+PART methodology 
performs best on every evaluation matrix and respective input parameters. 

In the future, the paper can be improved to apply discrimination technique with content based privacy 
in an online social network using the Hadoop environment. Due to hues, discrimination occurred in OSN, and 
it is required to work forward. 
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